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CROSS-REFERENCE TO RELATED APPLICATIONS 

[0001] This apphcation is a continuation-in-part of U.S. Pat. App. No. 09/186,725, filed 
November 4, 1998. 

BACKGROUND 
Field of the Invention 

[0002] This invention relates to multipUers and multiplication methods capable of 
multiplying large multipUcands and performing multiple parallel multiplications of small 
multiplicands. 

Description of Related Art 

[0003] A multiplier is often one of the largest circuit units in a microprocessor or a digital 
signal processor (DSP). The size of a multiplier can be a particular problem in video processing 
where high-performance processing often requires parallel multiplications. Additionally, video 
processing often needs to multiply relatively small multipUcands (e.g., 8-bit time domain pixel 
data) and larger multipUcands (e.g., 16-bit frequency domain data.) A large multipUer designed 
for the larger multipUcands can multiply a pair of the smaller multiplicands, but providing a large 
number of large multipUers requires a large amount of circuit area and increases the 
manufacturing cost of an integrated circuit containing on-chip multipUers. Having two sets of 
muhipUers, one set including a large number of smaller multipliers for small multiplicands and a 
second set containing a smaller number of larger multipUers for larger multipUcands, also 
requires a large circuit area without a corresponding increase in performance since the smaller 
multipliers generally cannot be used when multiplying larger multipUcands. 

[0004] A processor or multiplier architecture is desired that requires a minimum circuit area, 
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can multiply larger multiplicands, and perform multiple parallel multiplications of small 
multiplicands. 

SUMMARY 

[0005] In accordance with an aspect of the invention, a multiplier circuit includes a plurality 
of multipliers. The multipliers are capable of operating separately for parallel multipUcations of 
multiplicands having a small data width or operating cooperatively form multiplications of 
multiplicands having a larger data width. 

[0006] hi one embodiment of the invention, a multiply unit includes one or more set of four 
multiphers and one or more adder that combines results from an associated set of multipliers. 
The multiphers in a set when operating independently generate four products, for example, four 
products of 8-bit values. When four multiphers operate cooperatively with the associated adder, 
the adder combines the results from the four multiphers to generate a product of two double- 
width operands, for example, the product of two 16-bit operands. To combine the results from 
the multiphers, the adder has an input ports that are larger than oulput ports of the multiphers, 
and the output port of each multipher is coupled to bits within an input port of the adder 
according to the significance of the product determined by the multiplier. An output circuit for 
the multiply unit provides output signals from the multiphers when the multiply unit operates in 
a first mode (e.g., for parallel multiplications of single-width multiplicands), and provides an 
output signal from the adder when the multiply unit operates in a second mode (e.g., for 
multiplication of double-width multiplicands). The multiphcation unit fiirther includes an 
operand selection circuit that selects different portions of input operands for each multipher. The 
portions selected for a multiplier typically depend on the processor's operating mode. 

[0007] hi accordance with another embodiment of the invention, a multiply unit includes a 
first multipher, a second multipher, a third multiplier, and a fourth multipher coupled to an 
adder. The first multipher is connected so that a least significant bit output from the first 
multiplier corresponds to a least significant bit in the adder. The second and third multiphers are 
connected so that a least significant bit output from the second multipher and a least significant 
bit output from the third multiplier correspond to a first bit that is more significant than the least 
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significant bit of the adder. The fourth multiplier is connected so that a least significant bit 
output from the fourth multiplier corresponds to a second bit that is more significant than the first 
bit in the adder. An output circuit provides output signals from the multipliers when the 
multipher circuit operates in a first mode, and provides an output signal from the adder when the 
multiply unit operates in a second mode. 

[0008] To control timing, latch circuits between the multipliers and the adder can register the 
output signals of the multipHers so that the multipHers perform multiplication operations during a 
first clock cycle and the adder combines the output signals during a second clock cycle. The 
adder can thus be in another portion of the circuit such as in an arithmetic logic unit, where the 
adder performs normal addition. Alternatively, the adder and the multipliers can operate during 
the same clock cycle. 

[0009] Generally, the multiply unit fiirther includes operand selection logic coupled to the 
multipliers. In the first mode, the selection logic provides a pair of single-width multiplicands to 
each multiplier for multiplication, hi the second mode, the operand selection logic separates a 
first double-width multiplicand into a first partial multipUcand and a second partial multipUcand, 
separates a second double-width multipUcand into a third partial multipUcand and a fourth partial 
multiplicand, provides the first and third partial multiplicands to the first multipUer for 
multiplication, first and fourth partial multiplicands to the second multipUer for multipUcation, 
second and third partial multipUcands to the third multipUer for multipUcation, and second and 
fourth partial multipUcands to the fourth multipUer for multipUcation. For signed multiplicands, 
two's complement units can provide the first and second double-width multiplicands 
representing absolute values of the respective signed input values, and sign correction circuits 
associated with the multipUers can correct the sign of the output signals from the multipUers. 

[0010] In accordance with another embodiment of the invention, a method for operating a 
multiply unit containuig a plurality of multipUers, includes: operating tiie multipUers separately 
to generate a plurality of output product values when the multiply unit operates in a first mode; 
and combining product values from the multipUers to generate only a single output product value 
when the multiply unit operates in a second mode. The output product values when the multiply 
unit operates in the first mode have a first data width that is about one fourth of a data width that 
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the single output product value has when the multiply unit operates in the second mode. 



BRIEF DESCRIPTION OF THE DRAWINGS 

[0011] Fig. 1 is a block diagram of a multiply unit in accordance with an embodiment of the 
invention. 

[0012] Fig. 2 is a block diagram of one of four small multiply units used in the multiply unit 
of Fig. 1. 

[0013] Fig. 3 is a block diagram of a multiply unit in accordance with another embodiment 
of the invention. 

[0014] Fig. 4 is a block diagram of one of four small multiply units used in the multiply unit 
of Fig. 3. 

[0015] Use of the same reference symbols in different figures indicates similar or identical 
items. 

DETAILED DESCRIPTION 

[0016] In accordance with an aspect of the invention, a processor has an architecture that 
efficiently performs video data processing such as motion searches, horizontal filtering, vertical 
filtering, and half-pixel interpolation and performs general-purpose processing for general 
control of video, audio, and modem data processing. The processor is operable in different 
modes for different types of processing. The architecture provides multiple data path sUces for 
parallel processing of pixel values during video processing modes and cooperative processing for 
a wider data path during a general processing mode. In particular, separate shces in a multiply 
unit perform multiple parallel multiplications for pixel processing or motion estimation and 
cooperative operations for general-purpose processing. 

[0017] Fig. 1 illustrates a multiply unit 100 in accordance with an embodiment of the 
invention. Multiply unit 1 00 would typically be used in a processor, DSP, or other integrated 
circuit that performs arithmetic operations. U.S. Pat. App. No. 09/186,725, which is hereby 
incorporated by reference in its entirety, further describes a video processor that contains two 
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multiply unit of a type such as illustrated in Fig. 1 . 

[0018] Multiply unit 100 includes two's complement units 1 lOA and 1 lOB, extension logic 
115, multiplicand selection logic 120, four multiplier slices 130, 131, 132, and 133, an adder 
140, a latch 100, and output selection circuits 152, 154, and 156. Each of the multiplier slices 
1 30, 1 3 1 , 1 32, and 1 33 contains a multiplier that operates in parallel with the multipMers in the 
other sHces. Depending on the operating mode of multiply unit 100, the multiphers in sUces 130, 
131, 132, and 133 operate either independently to produce four separate products or 
cooperatively to perform one multiplication of larger multiphcands. hi the exemplary 
embodiment, multiply unit performs either four parallel multiphcations of signed or unsigned 
8-bit values or one multiphcation of 16-bit values, and each multipHer is a 9x9-bit signed 
multiplier. 8x8-bit multipliers are sufficient in an embodiment requiring the capability to 
multiply only signed 8-bit operands or only unsigned 8-bit operands. More generally, 
embodiments of the invention can be apphed to any data width according to the values being 
multiphed. 

[001 9] hi the illustrated embodiment, multiply unit 1 00 receives two 32-bit signals INA and 
INB, which are input to two's-complement units 1 1 OA and HOB, respectively. The 
interpretation and processing of input signals INA and INB depends on the operating mode of 
multiply unit 100. 

[0020] In a first operating mode (e.g., pixel processing mode), each 32-bit input signal INA 
or INB represents four 8-bit signed or unsigned operands, and extension logic 115 uses separate 
bytes of signals INA and INB to construct operands AO, Al, A2, A3, BO, Bl, B2, and B3. hi 
particular, for signed multiplications, extension logic 115 sign extends each 8-bit operand to 
provide 9-bit signed multiphcands AO, Al, A2, A3, BO, Bl, B2, and B3. For unsigned 
multiphcations, extension logic 115 adds a ninth bit having value zero to each 8-bit operand to 
create 9-bit positive multiphcands AO, Al, A2, A3, BO, Bl, B2, and B3. Multiplicand selection 
logic 120 then selects pairs of 9-bit multiphcands (AO, BO), (Al, Bl), (A2, B2), and (A3, B3) for 
slices 130, 131, 132, and 133, respectively. 

[0021] In a second operating mode (e.g., a general processing mode), each signal INA and 
INB represents a 16-bit signed value, and two's complement vmits 1 lOA and 1 lOB perform a 
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two's complement on any negative signed values to generate positive 16-bit values. The sign 
information that two's-complement units 1 1 OA and HOB determine from signals INA and INB 
is used in determining the sign of a final product as described further below. Extension logic 1 15 
breaks the two 16-bit positive values into four 8-bit values and adds a ninth bit having value zero 
to each 8-bit value to generate partial multiplicands AO, Al, BO, and Bl. hi the second mode, 
multiphcand selection logic 120 then selects pairs of 9-bit multiphcands (AO, BO), (Al, BO), 
(AO, Bl), and (Al, Bl) for slices 130, 131, 132, and 133, respectively. 

[0022] Fig. 2 is a block diagram of shce 130, which includes a signed 9x9-bit multiplier 220 
an adder 230, a rounding register 235, a clamp 240, an accumulator 250, a multiplexer 260, and a 
shifter 270. Slices 131, 132, and 133 have the same structure as slice 130. hi slice 130, 
multiplier 220 performs a signed multiplication of two 9-bit signed multiphcands AO and BO. 
The resulting product from multiplier 220 is nominally a 17-bit signed value but actually only 
requires at most 16-bits to express since the 9-bit signed values were extended from 8-bit signed 
or unsigned values. The sign bit of the 17-bit product is stripped off to provide signal TERM, 
which is a 16-bit signal representing the product of two unsigned 8-bit values. As described 
further below, signal TERM is for combination with similar product signals from the other 
multipliers when multiply unit 100 operates in the second mode to multiply 16-bit multiplicands. 

[0023] The data path to adder 230 is for separate multiply, multiply-and-accumulate 
operations, and filtering operations that multiply unit 100 performs on 8-bit values. Two sign 
bits can be stripped off the 17-bit product signal without loss of mformation m this data path. 
Additionally, the four least significant bits are ignored in exemplary embodiment, which limits a 
result signal OUTS to eight bits. As a result, in the exemplary embodiment, adder 230 receives 
an 1 1-bit product signal from multiplier 220. 

[0024] Adder 230 adds a value from a register 235 and/or a value from shifter 270 to the 
product from multipUer 220. Register 235 stores a value that selects a rounding mode that 
apphes if the sum from adder 230 is right shifted, for example, for rounding down or up after a 
divide by two. Shifter 270 provides to adder 230 a value that is either zero or derived from the 
content of accumulator 250. For a simple multiphcation, a multiplexer 260 provides a zero- 
valued data signal to shifter 270, and shifter 270 provides a zero-valued addend to adder 230. 
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For a multiply-and-accumulate operation, multiplexer 260 selects the value from accumulator 
250. Shifter 270 can either shift the accumulated value from accumulator 250 or leave the 
accumulated value unchanged. For normal multiply-and-accumulate operations, adder 230 
receives and adds the unchanged accumulated value to the product output from multipher 220. 
For filter operations, shifter 270 shifts the accumulated value according to a desired weighting 
between the product and the accumulated value. 

[0025] A clamp circuit 240 selects eight output bits from the operation performed in shce 
130 and handles overflow situations by clamping the sum from adder 230 as the operation 
requires. A result signal OUT8[7:0] from clamp circuit 240 of slice 130 represents a clamped 
product of two signed or unsigned values AO and BO. 

[0026] hi general processing mode, multiply unit 100 performs a multiplication of two 16-bit 
operands derived from signals INA and INB. Two's complement units llOA and 1 lOB provide 
positive 16-bit values in two operands A and B and separately provide two sign bits for selecting 
the signs of resulting products of signed multiplications. For signed multiphcation, two's- 
complement units 1 lOA and 1 1 OB determine the two's-compliments of any negative 16-bit 
values in the original operands INA and INB and determine the sign bits accordingly. For 
unsigned multiphcation, the 32-bit signals INA and INB are simply truncated to sixteen bits. 

[0027] Multiply unit 100 performs a 16xl6-bit multiplication to generate a 32-bit output 
OUT32. Specifically, shces 130 to 133 multiply a 16-bit value mcludmg bytes AO and Al of 
operand A by a 16-bit value including bytes BO and Bl of operand B. Multiply unit 100 ignores 
bytes A2, A3, B2, and B3 in general processing mode. 

[0028] hi Fig. 1, slices 130 to 133 operate cooperatively for multiphcation of two 16-bit 
positive values, hi particular, multipliers 220 in shces 130, 131, 132, and 133 respectively 
determine products A0*B0, A1*B0, A0*B1, and A1*B1. The products are 16-bit values that are 
portions of 32-bit values input to an adder 140. Product AO*BO provides 16 bits ahgned on the 
right with bit 0 of adder 140. Products A0*B1 and A1*B0 are ahgned on the right with bit 8 of 
adder 140, and product A1*B1 is ahgned on the right with bit 16 of adder 140. The sum from 
adder 140, which is a 32-bit value representing the product of positive 16-bit values, is held in a 
latch 150. For signed multiphcations, a multiplexer 154 selects the positive product from latch 



-7- 



150 or a one's-complement value of the product according to sign bits from two's-complement 
units 11 OA and 1 1 OB. An inverter 152 inverts each bit in the positive product to generate the 
one's-complement value. An arithmetic logic unit (ALU) completes the multiplication by 
adding one to the result thereby completing a two's-complement for negative products. The 
ALU can simultaneously add a further value from an accumulator (not shown) for multiply-and- 
accumulate operations. 

[0029] Fig. 3 illustrates an multiply unit 300, which includes four slices 330, 331, 332, and 
333 in accordance with an alternative embodiment of the invention. In general processing mode, 
slices 330, 331, 332, and 333 cooperate to perform one 16xl6-bit multiplication. For the 16x16- 
bit multiplication, two's complement units 1 lOA and 1 lOB performs a two's complement on any 
negative 16-bit multiplicands and provide two positive 16-bit multipHcands A and B to the 
portion of multipher 100 shown in Fig. 3. Two complement units llOA and HOB also provide 
sign bits SIGNA and SIGNB indicating the signs of the respective input values INA and INB of 
two's complement units 1 lOA and HOB, and an XOR operation on sign bits SIGNA and SIGNB 
indicates the sign SIGNO of the final product. 

[0030] For the 1 6x 1 6-bit multiphcation, extension logic 1 1 5 and multiphcand selection logic 
120 generate two 9-bit multiplicands AO and Al from 16-bit operand A and generates two 9-bit 
multiphcands BO and Bl from 16-bit operand B. SUces 330, 331, 332, and 333 respectively 
received operand pairs (AO, BO), (Al, BO), (AO, Bl), and (Al, Bl) for multiplications. 

[0031] Slices 330, 33 1 , 332, and 333, which perform four multiplications in parallel, contain 
similar or identical components, and Fig. 4 illustrates structure of slice 330 as an example. Shoe 
330 includes signed 9x9-bit multipher 220, adder 230, clamp circuit 240, accumulator 206, 
multiplexer 207, and shifter 260 that operate in the manner described above in reference to slice 
130 (Fig. 2). Slice 330 differs from slice 130 in that slice 330 includes a two's complement unit 
410 and a multiplexer 420 connected to the output of multiplier 220. Two's complement unit 
410 performs a two's complement on a 16-bit product output from multiplier 220 as a result of 
multiplying two unsigned 8-bit multiplicands (e.g., AO and BO). Multiplexer 420 then selects 
either the negative value from two's complement unit 410 or the positive value from multiplier 
220 depending on the desired sign SIGNO of the final output product OUT32. 
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[0032] Returning to Fig. 3, latch circuits 350, 351, 352, and 353 register respective output 
signals ROUTO, ROUTl, ROUT2, and ROUTS from respective slices 330, 331, 332, and 333 at 
the end of a first clock cycle of a 16xl6-bit operation. Latch circuits 350, 351, 352, and 353 can 
also expand output signals ROUTO, ROUTl, R0UT2, and R0UT3, which are 16-bit signed 
values, to 40-bit values TERMO, TERMl, TERM2, and TERM3. The expansion places each 
signal ROUTO, ROUTl, R0UT2, and R0UT3 in the appropriate bit position for addition by a 
40-bit adder 340. In particular, latch circuit 350 sign extends output signal ROUTO to 40 bits. 
Latch circuits 351 and 352 add eight zero-valued bits to the right of respective signals ROUTl 
and ROUT2 and sign extend each resulting 24-bit value to 40 bits. Latch circuit 353 adds 
sixteen zero-valued bits to the right of signal R0UT3 and sign extends the resultmg 32-bit value 
to 40 bits. 

[0033] For a 1 6x 1 6-bit multiply operation, adder 340 adds the values of the four signals 
TERMO, TERMl, TERM2, and TERMS during a second clock cycle of the operation. For a 
16xl6-bit multiply-and-accumulate operation, adder 340 adds the values of the four signals 
TERMO, TERMl, TERM2, and TERMS to a 40-bit value that a 40-bit accumulator 344 provides 
via a multiplexer 342. The multipher embodunent of Fig. 3 thus uses slices 330, 331, 332, and 
333 to perform multiplications during a first clock cycle and adder 340 to combine the results of 
the multiplications during a second clock cycle. In contrast, for 16x1 6-bit multiphcations in the 
multiplier embodiment of Fig. 2, slices 130, 131, 132, and 133 and adder 140 perform 
multiplications and additions in the same clock cycle. Accordingly, the timing required in the 
embodiment of Fig. 3 may be easier to achieve. Additionally, in a processor, adder 340 can be 
part of an arithmetic logic unit that performs general arithmetic operations not limited to multiply 
operations associated with multiply unit 300. 

[0034] Table 1 contains verilog code for implementing an embodiment of multiply unit 300 
of Fig. 3. 

Table 1 : Verilog Code for Multiplv Unit Embodiment . 

[module mac_rtl (reset, elk, ina, inb, out, mpy) ; 

input reset, elk, mpy; 
input [15:0] ina; 
input [15:0] inb; 
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output 


[39:0] 


out ; 




wire 


[39:0] 


addout ; 




wire 


[39:0] 


muxout ; 




reg 


[39:0] 


accout ; 




wire 


[8:0] inaO, 


inal, ina2, ina3; 




wire 


[8:0] inbO, 


inbl , inb2 , inb3 ; 




wire 


[15:0] 


outO, outl, out2, out3; 


wire 


[15:0] 


toutO, toutl, tout2. 


tout 3 ; 


reg 


[15:0] 


rout 0 , rout 1 , rout 2 , 


rout 3 ; 


wire 


[31:0] 


tempout ; 




wire 


[15:0] 


tina, tinb; 




wire 


signa, signb, signo; 




reg 


rsigno; 







// This will be previous stage calculation - two's complement block 
assign tina = ina[15] ? (~ina[15:0] + 1) : ina[15:0]; 

assign tinb = inb[15] ? (-inb[15:0] + 1) : inb[15:0]; 

// assign the sign bits 



assign signa = ina[15]; 

assign signb = inb[15]; 
assign signo = ina[15] ^ inb[15]; 

assign inaO = { 1 » bO , tina [7 : 0] } ; 

assign inbO = {l ' bO , tinb [7 : 0] } ; 

assign inal = {l ' bO , tina [7 : 0] } ; 

assign inbl = {l ' bO , tinb [15 : 8] } ; 

assign ina2 = {l 'bO, tina [15 : 8] } / 

assign inb2 = { 1 ' bO , tinb [7 : 0] } ; 

assign ina3 = {I'bO, tina [15 :8] } ; 

assign inb3 = {l 'bO , tinb [15 : 8] } ; 



mult 9 mulO ( .ina(inaO) , .inb(inbO), .out(outO)) 

mult9 mull {.ina (inal) , .inb(inbl), .out (outl)) 

mult9 mul2 (.ina(ina2) , .inb(inb2), .out(out2)) 

mult9 mul3 ( .ina (ina3) , .inb(inb3), .out(out3)) 



assign 
assign 
assign 
assign 



toutO 
toutl 
tout2 
touts 



signo ? {«outO+l) 

signo ? (-outl+1) 

signo ? ('-out2 + l) 

signo ? (-out3+l) 



outO; 
outl; 
out2 ; 
out 3 ; 



always ©(posedge cl'k) begin 
if (reset) begin 

routO <= #1 16»b0; 

routl <= #1 16 'bO; 

rout2 <= #1 16 'bO; 

rout3 <= #1 16 'bO; 
end else begin 

routO <= #1 toutO; 

routl <= #1 toutl; 

rout2 <= #1 tout2; 
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routs <= #1 touts; 

end 

end 

// pipeline sign signal 

always @ (posedge elk) begin 

if (reset) rsigno <= #1 I'bO; 
else rsigno <= #1 signo; 

end 



// mux description 

assign muxout == mpy ? 40 'hO : accout; 
// adder description 

wire [39:0] termS = rsigno ? { { 8 {rout3 [15] } } , rout3 , 16 ' hO } : 

{8»]i0,rout3,16'h0}; 

wire [39:0] term2 = rsigno ? { {l6{rout2 [15] } } , rout2 , 8 'liO } : 

{I6']i0,rout2,8'h0}; 

wire [39:0] terml = rsigno ? { { 16 {routl [15] } } , routl , 8 ' hO ) : 
{I6»h0,routl,8'h0} ; ^ ^ . 

wire [3 9:0] termO = rsigno ? { {24 { routO [15] } } , routO } : { 24 ' hO , routO } ; 

assign addout = termS + term2 + terml + termO + muxout ; 

// accumulator description 
always ©{posedge cl]c) begin 

if (reset) accout <= #1 40 'hO; 

else accout <= #1 addout; 

end 



assign out = addout; 

endmodule 



[0035] Although the invention has been described with reference to particular embodiments, 
the description is only an example of the invention's application and should not be taken as a 
limitation. In particular, the data widths described herein are merely examples in particular 
embodiments for the invention, but embodiments of the invention can be implemented for data 
widths other than the examples described here. Various other adaptations and combinations of 
features of the embodiments disclosed are within the scope of the invention as defined by the 
following claims. 
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